PCI: hv: Reserve hv_pci swiotlb from buddy and publish via sysfs#260
PCI: hv: Reserve hv_pci swiotlb from buddy and publish via sysfs#260benhillis wants to merge 1 commit into
Conversation
9077664 to
9bc4922
Compare
There was a problem hiding this comment.
Pull request overview
Reworks how the hv_pci driver provisions its dedicated swiotlb pool. Instead of accepting a host-supplied GPA (which could be reclaimed by Hyper-V page-reporting and triple-fault the guest), the driver now reserves a contiguous DMA32 range from the buddy allocator at core_initcall time and exposes the resulting base/size via /sys/bus/vmbus/drivers/hv_pci/swiotlb_{base,size} for a userspace agent to forward to the host.
Changes:
- Replaces
<base>,<size>cmdline parsing with size-onlyhv_pci_swiotlb=<size>, 2 MiB aligned. - Adds a
core_initcall(hv_pci_swiotlb_alloc_pool) that usesalloc_contig_pages(GFP_KERNEL|__GFP_DMA32|__GFP_ZERO, …)to back the pool, gated byCONFIG_CONTIG_ALLOCwith a no-op fallback. - Adds
DRIVER_ATTR_RO(swiotlb_base/size)sysfs files published aftervmbus_driver_register()and removed on exit; backing pages are intentionally leaked on driver unload.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9bc4922 to
ac7c4d3
Compare
ac7c4d3 to
d7b840a
Compare
d7b840a to
498d391
Compare
|
|
||
| /* UMA on WSL; first_online_node biases nothing in practice. */ | ||
| pages = alloc_contig_pages(nr_pages, | ||
| GFP_KERNEL | __GFP_DMA32 | __GFP_ZERO, |
There was a problem hiding this comment.
Any particular reason why we use __GFP_DMA32?
There was a problem hiding this comment.
Same reason mainline swiotlb uses it, the bounce buffer must be reachable by 32-bit-DMA PCI devices, and ZONE_DMA32 is the safe universal home.
There was a problem hiding this comment.
We don't have any 32-bit-DMA PCI devices.
There was a problem hiding this comment.
It's probably OK to do this for now, though. We can relax this later if testing shows it's not necessary.
The old early_param parsed hv_pci_swiotlb=<base>,<size> and reserved the
host-supplied physical address with memblock_reserve(), which does not
validate that the range is backed by EPT. Under Hyper-V page-reporting
the backing for a nominally usable e820 range can be absent, so the
memset() inside swiotlb_init_io_tlb_pool() triple-faulted the guest.
Pick the base in the guest instead:
* A core_initcall calls alloc_contig_pages(__GFP_DMA32 | __GFP_ZERO)
for a kernel-owned, contiguous, below-4G range. __GFP_ZERO faults
the pages in, and kernel ownership keeps page reporting away. Gated
on CONFIG_CONTIG_ALLOC; without it the dedicated pool is skipped.
* The hv_pci_swiotlb= early_param and the alloc core_initcall are only
compiled in for built-in builds (#ifndef MODULE), because both
early_param and core_initcall are unavailable from modules. Module
builds compile cleanly and fall back to the default swiotlb pool.
* (base, size) is exposed via DRIVER_ATTR_RO under
/sys/bus/vmbus/drivers/hv_pci/swiotlb_{base,size} so userspace can
forward the real GPA to the host-side device backend.
swiotlb has no destroy_pool() counterpart, so the pages are leaked on
driver unload; hv_pci is rarely hot-replaced and the pool is bounded.
Signed-off-by: Ben Hillis <benhillis@microsoft.com>
498d391 to
3b14370
Compare
Summary
Reserve the dedicated hv_pci swiotlb pool from the buddy allocator at
core_initcalltime and publish the resulting(base, size)under/sys/bus/vmbus/drivers/hv_pci/swiotlb_{base,size}so userspace can forward the real GPA to the host-side device backend. This replaces the old "host dictates a GPA" flow.Why
WSL container test runs intermittently saw the guest die with
WorkerExitType=StoppedOnReset WorkerExitDetail=TripleFault WorkerExitInitiator=GuestOSbetweenio scheduler mq-deadline registeredand the next initcall. Root cause:memblock_reserve()accepts ranges that are not actually backed by EPT, andswiotlb_create_pool()->swiotlb_init_io_tlb_pool()thenmemsets 64 MiB of unbacked pages.What changed
hv_pci_swiotlb=<size>is now the only accepted form.early_hv_pci_swiotlb()parses withmemparse(p, &end)and rejects any unconsumed trailing characters withpr_warn, so the legacy<base>,<size>form (whichmemparse(p, NULL)would otherwise silently treat as just the leading hex base) is no longer accepted.core_initcall(hv_pci_swiotlb_alloc_pool)asks the buddy allocator for a contiguous DMA32 range viaalloc_contig_pages(__GFP_DMA32 | __GFP_ZERO, first_online_node, &node_online_map).__GFP_ZEROfaults every page in via the page allocator, so by the timeswiotlb_create_pool()runs the memory is known-good. Kernel ownership keeps Hyper-V page reporting from yanking the backing. Running atcore_initcall(initcall level 1) gives us the earliest possible shot at a fresh, mostly-empty DMA32 zone.CONFIG_CONTIG_ALLOCwith a no-op stub fallback.hv_pci_swiotlb=early_param and the alloccore_initcallare wrapped in#ifndef MODULE, because bothearly_paramandcore_initcallare unavailable from modules (otherwisecore_initcallredefinesmodule_init'sinit_module). Module builds compile cleanly and fall back to the default swiotlb pool. The WSL config usesCONFIG_PCI_HYPERV=y, so the feature is active there.(base, size)published viaDRIVER_ATTR_ROoncevmbus_driver_register()succeeds. Anhv_pci_swiotlb_publishedflag makespublish()/unpublish()symmetric and idempotent, so a partial sysfs-create failure cleans up only what it created and module exit can't double-unpublish.Validation
scripts/checkpatch.pl --strict -g HEAD-> 0 errors, 0 warnings, 0 checks, 203 lines checked.make W=1 drivers/pci/controller/pci-hyperv.ois clean with bothCONFIG_PCI_HYPERV=yandCONFIG_PCI_HYPERV=m(arm64 cross-build).swiotlb=force hv_pci_swiotlb=64M:dmesg:hv_pci: reserved swiotlb pool [0x0000000008000000..0x000000000c000000)/sys/bus/vmbus/drivers/hv_pci/swiotlb_base->0x8000000/sys/bus/vmbus/drivers/hv_pci/swiotlb_size->67108864Notes
swiotlbhas nodestroy_pool()counterpart toswiotlb_create_pool(), so the backing pages are deliberately leaked on driver unload. hv_pci is rarely hot-replaced and the pool is bounded (default 64 MiB).